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J Abstract 

, We study the problem of learning local metrics for nearest neighbor classification. 

Most previous works on local metric learning learn a number of local unrelated 

t-H metrics. While this "independence" approach delivers an increased flexibility its 

^ downside is the considerable risk of overfitting. We present a new parametric local 

metric learning method in which we learn a smooth metric matrix function over the 
data manifold. Using an approximation error bound of the metric matrix function 
we learn local metrics as linear combinations of basis metrics defined on anchor 
points over different regions of the instance space. We constrain the metric matrix 

0^ function by imposing on the linear combinations manifold regularization which 

makes the learned metric matrix function vary smoothly along the geodesies of 

CN the data manifold. Our metric learning method has excellent performance both 

in terms of predictive power and scalability. We experimented with several large- 
^ scale classification problems, tens of thousands of instances, and compared it with 

• *™j several state of the art metric learning methods, both global and local, as well as to 

SVM with automatic kernel selection, all of which it outperforms in a significant 
manner. 



1 Introduction 



The nearest neighbor (NN) classifier is one of the simplest and most classical non-linear classifica- 
tion algorithms. It is guaranteed to yield an error no worse than twice the Bayes error as the number 
of instances approaches infinity. With finite learning instances, its performance strongly depends 
on the use of an appropriate distance measure. Mahalanobis metric learning (U Q21 13 [TOj [T7J [14) 
improves the performance of the NN classifier if used instead of the Euclidean metric. It learns 
a global distance metric which determines the importance of the different input features and their 
correlations. However, since the discriminatory power of the input features might vary between dif- 
ferent neighborhoods, learning a global metric cannot fit well the distance over the data manifold. 
Thus a more appropriate way is to learn a metric on each neighborhood and local metric learn- 
ing |[8j [3j [T5j [7]| does exactly that. It increases the expressive power of standard Mahalanobis metric 
learning by learning a number of local metrics (e.g. one per each instance). 
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Local metric learning has been shown to be effective for different learning scenarios. One of the 
first local metric learning works, Discriminant Adaptive Nearest Neighbor classification (8), DANN, 
learns local metrics by shrinking neighborhoods in directions orthogonal to the local decision bound- 
aries and enlarging the neighborhoods parallel to the boundaries. It learns the local metrics inde- 
pendently with no regularization between them which makes it prone to overfitting. The authors of 
LMNN-Multiple Metric (LMNN-MM) [Q31 significantly limited the number of learned metrics and 
constrained all instances in a given region to share the same metric in an effort to combat overfitting. 
In the supervised setting they fixed the number of metrics to the number of classes; a similar idea 
has been also considered in 0. However, they too learn the metrics independently for each region 
making them also prone to overfitting since the local metrics will be overly specific to their respec- 
tive regions. The authors of lfT6l learn local metrics using a least-squares approach by minimizing a 
weighted sum of the distances of each instance to apriori defined target positions and constraining the 
instances in the projected space to preserve the original geometric structure of the data in an effort to 
alleviate overfitting. However, the method learns the local metrics using a learning-order-sensitive 
propagation strategy, and depends heavily on the appropriate definition of the target positions for 
each instance, a task far from obvious. In another effort to overcome the overfitting problem of the 
discriminative methods (8l|T3, Generative Local Metric Learning, GLML, flTTl . propose to learn 
local metrics by minimizing the NN expected classification error under strong model assumptions. 
They use the Gaussian distribution to model the learning instances of each class. However, the 
strong model assumptions might easily be very inflexible for many learning problems. 

In this paper we propose the Parametric Local Metric Learning method (PLML) which learns a 
smooth metric matrix function over the data manifold. More precisely, we parametrize the metric 
matrix of each instance as a linear combination of basis metric matrices of a small set of anchor 
points; this parametrization is naturally derived from an error bound on local metric approximation. 
Additionally we incorporate a manifold regularization on the linear combinations, forcing the linear 
combinations to vary smoothly over the data manifold. We develop an efficient two stage algorithm 
that first learns the linear combinations of each instance and then the metric matrices of the anchor 
points. To improve scalability and efficiency we employ a fast first-order optimization algorithm, 
FISTA [2], to learn the linear combinations as well as the basis metrics of the anchor points. We 
experiment with the PLML method on a number of large scale classification problems with tens of 
thousands of learning instances. The experimental results clearly demonstrate that PLML signifi- 
cantly improves the predictive performance over the current state-of-the-art metric learning methods, 
as well as over multi-class SVM with automatic kernel selection. 

2 Preliminaries 

We denote by X the nxd matrix of learning instances, the z-th row of which is the xf G M d instance, 
and by y = . . . , y n ) T , V% £ {1, . . . , c} the vector of class labels. The squared Mahalanobis 
distance between two instances in the input space is given by: 

^(x^xj) = (xi -x j ) T M(x i -Xj) 

where M is a PSD metric matrix (M >z 0). A linear metric learning method learns a Mahalanobis 
metric M by optimizing some cost function under the PSD constraints for M and a set of additional 
constraints on the pairwise instance distances. Depending on the actual metric learning method, 
different kinds of constraints on pairwise distances are used. The most successful ones are the 
large margin triplet constraints. A triplet constraint denoted by c(x^, Xj, x/c), indicates that in the 
projected space induced by M the distance between x* and Xj should be smaller than the distance 
between x^ and x&. 

Very often a single metric M can not model adequately the complexity of a given learning problem 
in which discriminative features vary between different neighborhoods. To address this limitation 
in local metric learning we learn a set of local metrics. In most cases we learn a local metric for 
each learning instance (8j[TTl, however we can also learn a local metric for some part of the instance 
space in which case the number of learned metrics can be considerably smaller than n, e.g. (T5). We 
follow the former approach and learn one local metric per instance. In principle, distances should 
then be defined as geodesic distances using the local metric on a Riemannian manifold. However, 
this is computationally difficult, thus we define the distance between instances x* and Xj as: 

^(xijXj) (xi -Xj) T Mi(xi - Xj) 



2 



where M$ is the local metric of instance x*. Note that most often the local metric of instance 
x$ is different from that of Xj. As a result, the distance d^. ( x i 7 x j ) does not satisfy the symmetric 
property, i.e. it is not a proper metric. Nevertheless, in accordance to the standard practice we will 
continue to use the term local metric learning following |[T5lfTTl . 



3 Parametric Local Metric Learning 



We assume that there exists a Lipschitz smooth vector- valued function /(x), the output of which 
is the vectorized local metric matrix of instance x. Learning the local metric of each instance is 
essentially learning the value of this function at different points over the data manifold. In order to 
significantly reduce the computational complexity we will approximate the metric function instead 
of directly learning it. 



Definition 1 A vector-valued function /(x) on 
function with respect to a vector norm \\-\\ if 

||/(x)-/(x')-V/(x') T (x-x')|| < /3||x-x„ 
the f function at x'. We assume a, j3 > and p G (0, 1]. 



||/(x)-/(x')|| < 
1+p , where V/(x') T 



(a, /?, p) -Lipschitz smooth 



a||x — x'|| and 
is the derivative of 



lfl~8l have shown that any Lipschitz smooth real function /(x) defined on a lower dimensional man- 
ifold can be approximated by a linear combination of function values /(u), u G U, of a set U of 
anchor points. Based on this result we have the following lemma that gives the respective error 
bound for learning a Lipschitz smooth vector- valued function. 



Lemma 1 Let (7, U) be a nonnegative weighting on anchor points U in 
Lipschitz smooth vector function. We have for all x G R d : 



d . Let f be an (a, (3,p)- 



/(x) - 7u(x)/(u) 



< a 



Yl 7u(x)u 



•/3^ 7u (x)||x- 



ii+p 



(1) 



The proof of the above Lemma [T] is similar to the proof of Lemma 2.1 in fT8ll ; for lack of space 
we omit its presentation. By the nonnegative weighting strategy (7, U), the PSD constraints on the 
approximated local metric is automatically satisfied if the local metrics of anchor points are PSD 
matrices. 

Lemma [T] suggests a natural way to approximate the local metric function by parameterizing the 
metric of each instance x^ as a weighted linear combination, W$ G M m , of a small set of 
metric basis, {M^ , . . . , M^}, each one associated with an anchor point defined in some region 
of the instance space. This parametrization will also provide us with a global way to regularize the 
flexibility of the metric function. We will first learn the vector of weights W$ for each instance x$, 
and then the basis metric matrices; these two together, will give us the metric for the instance 

More formally, we define am x d matrix U of anchor points, the z-th row of which is the anchor 
point Ui, where uf G R d . We denote by M&. the Mahalanobis metric matrix associated with u^. 
The anchor points can be defined using some clustering algorithm, we have chosen to define them 
as the means of clusters constructed by the /c-means algorithm. The local metric of an instance 
x^ is parametrized by: 

= 5> 



Mi 



s ibk M bk , W ibk 



bk 



(2) 



where W is a n x m weight matrix, and its Wih k entry is the weight of the basis metric M^^ for 
the instance x^. The constraint Wib k = 1 removes the scaling problem between different local 
metrics. Using the parametrization of equation ([2]), the squared distance of x^ to Xj under the metric 
M ? - is: 



^Mi( X *' X j) 



b k 



(3) 



where d\ /Vb (x^, Xj) is the squared Mahalanobis distance between x^ and Xj under the basis metric 
M5 fc . We will show in the next section how to learn the weights of the basis metrics for each instance 
and in section IX2l how to learn the basis metrics. 
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Algorithm 1 Smoothl Local Linear Weight Learning 

Input: W°, X, U, G, L, Ai, and A 2 
Output: matrix W 

define ^, Y (W) = g(Y) + tr(\7g(Y) T (W - Y)) + f || W - Y||* 

initialize: t t = 1, /3 = 1,Y X W°, and z = 

repeat 

z = i + 1, W 4 = Proj^Y* - ^V#(Y*))) 
while ^(W*) > ^(W^ do 

= 2/3, W l = Proj((Y* - ^V^(YO)) 
end while 

= i±\A±M ? Y i+1 = w* + |^(w* - w^- 1 ) 

until converges; 



3.1 Smooth Local Linear Weighting 

Lemma [T] bounds the approximation error by two terms. The first term states that x should be close 
to its linear approximation, and the second that the weighting should be local. In addition we want 
the local metrics to vary smoothly over the data manifold. To achieve this smoothness we rely 
on manifold regularization and constrain the weight vectors of neighboring instances to be similar. 
Following this reasoning we will learn Smooth Local Linear Weights for the basis metrics by mini- 
mizing the error bound of ([I]) together with a regularization term that controls the weight variation 
of similar instances. To simplify the objective function, we use the term ||x — XIugu 7u( x )u|| 
instead of ||x — X^ugu 7u( x ) u ||- By including the constraints on the W weight matrix in ([5]), the 
optimization problem is given by: 

min g(W) = ||X - WU\\ 2 F + Aitr(WG) + A 2 tr(W T LW) (4) 
w 

s.t. W ibk >0,^2w ibk = l,\/i,b k 

where tr(-) and || • \\ F denote respectively the trace norm of a square matrix and the Frobenius norm 
of a matrix. The m x n matrix G is the squared distance matrix between each anchor point and 
each instance x j 9 obtained for p = 1 in (m), i.e. its entry is the squared Euclidean distance 
between and Xj. L is the n x n Laplacian matrix constructed by D — S, where S is the n x n 
symmetric pairwise similarity matrix of learning instances and D is a diagonal matrix with = 
Sifc. Thus the minimization of the tr(W T LW) term constrains similar instances to have similar 
weight coefficients. The minimization of the tr(WG) term forces the weights of the instances 
to reflect their local properties. Most often the similarity matrix S is constructed using /c-nearest 
neighbors graph |fT9l . The Ai and A 2 parameters control the importance of the different terms. 

Since the cost function g(W) is convex quadratic with W and the constraint is simply linear, ^ is 
a convex optimization problem with a unique optimal solution. The constraints on W in ^ can be 
seen as n simplex constraints on each row of W; we will use the projected gradient method to solve 
the optimization problem. At each iteration t, the learned weight matrix W is updated by: 

w t+i = p roj ^ w t _ ^(W*)) (5) 

where 77 > is the step size and V#(W*) is the gradient of the cost function g(W) at W £ . The 
Proj(-) denotes the simplex projection operator on each row of W. Such a projection operator can 
be efficiently implemented with a complexity of 0(nm\og(m)) \6\. To speed up the optimization 
procedure we employ a fast first-order optimization method FISTA, Q. The detailed algorithm is 
described in Algorithm |T] The Lipschitz constant f3 required by this algorithm is estimated by using 
the condition of #(W*) < ^ )Y i(W J ) Q. At each iteration, the main computations are in the 
gradient and the objective value with complexity 0(nmd + n 2 m). 

To set the weights of the basis metrics for a testing instance we can optimize ^ given the weight of 
the basis metrics for the training instances. Alternatively we can simply set them as the weights of 
its nearest neighbor in the training instances. In the experiments we used the latter approach. 
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3.2 Large Margin Basis Metric Learning 



In this section we define a large margin based algorithm to learn the basis metrics , . . . , M. bm . 
Given the W weight matrix of basis metrics obtained using Algorithm [T] the local metric Mj of 
an instance x$ defined in ^ is linear with respect to the basis metrics , . . . , . We define 
the relative comparison distance of instances x$, Xj and x/c as: dj^. (x^ x k ) — (xi,Xj). In 
a large margin constraint c(x^, Xj, X&), the squared distance dj^. (a^, is required to be larger 
than dj^. {xi,Xj) + 1, otherwise an error £ ij/c > is generated. Note that, this relative comparison 
definition is different from that defined in LMNN-MM [15]. In LMNN-MM to avoid over-fitting, 
different local metrics Mj and are used to compute the squared distance d^. (a?*, Xj) and 

^M fc ( x i: x k) respectively, as no smoothness constraint is added between metrics of different local 
regions. 

Given a set of triplet constraints, we learn the basis metrics , . . . , M. bm with the following 
optimization problem: 

bi ijk ij bi 

^2w ibl (d 2 Mbi (xi.x^-d^ixi.Xj)) > l-£ ijk Vi,j,k 

bi 

€i ifc >0; ViJ,kM bl hO; V6 Z 

where ai and a 2 are parameters that balance the importance of the different terms. The large margin 
triplet constraints for each instance are generated using its k\ same class nearest neighbors and k<2 
different class nearest neighbors by requiring its distances to the ki different class instances to be 
larger than those to its k\ same class instances. In the objective function of ([6]) the basis metrics are 
learned by minimizing the sum of large margin errors and the sum of squared pairwise distances of 
each instance to its k\ nearest neighbors computed using the local metric. Unlike LMNN we add the 
squared Frobenius norm on each basis metrics in the objective function. We do this for two reasons. 
First we exploit the connection between LMNN and SVM shown in Q under which the squared 
Frobenius norm of the metric matrix is related to the SVM margin. Second because adding this term 
leads to an easy-to-optimize dual formulation of ^ fT2l . 

Unlike many special solvers which optimize the primal form of the metric learning problem fT5l[T3lL 
we follow 1 12] and optimize the Lagrangian dual problem of ([6]). The dual formulation leads to an 
efficient basis metric learning algorithm. Introducing the Lagrangian dual multipliers 7 ijfc , p^ k and 
the PSD matrices Z bl to respectively associate with every large margin triplet constraints, ^ - fe > 
and the PSD constraints h in ([6]), we can easily derive the following Lagrangian dual form 

ijk bi ijk ij 

1 > Hjk > 0; Vij, k Z bl h 0; V6 Z 

and the corresponding optimality conditions: = ^ Zb i^^^ kl ^ kW%b i^ k a2 ^ w tbl A ZJ ) 

1 > Jijk > 0, where the matrices and Cij k are given by x^x^ and x^x^— x^x^ respectively, 
where x^- = x z — x^ . 

Compared to the primal form, the main advantage of the dual formulation is that the second term 
in the objective function of d7]) has a closed- form solution for Z bl given a fixed 7. To drive the 
optimal solution of Z bl , let K& z = a 2 Y,ij W ihl Mj ~ ^2 ijk JijkW ibl C ijk . Then, given a fixed 7, 
the optimal solution of Z bl is Z£ = (K^) + , where (K bl )+ projects the matrix ~K bl onto the PSD 
cone, i.e. (K 6l )+ = U[max(diag(£)), 0)]U T with K bl = U£U T 

Now, ([7]) is rewritten as: 

min 5 ( 7 ) = -^ 7iife + ^^||(K bi ) + -K bi ||^ (8) 

ijk bi 

s.t. 1 > ji jk > 0;Vi, j, k 



mm 

M bl ,...,M bm ,£ 



S.t. 



max 

Z bl ,...,Zb m ,7 
S.t. 
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And the optimal condition for M bl is = ^((K^) + - K^). The gradient of the objective 

function in Vg(^ ijk ), is given by: \7g(-y ijk ) = -1 + Eb z 2^ (( K ^)+ ~ K bn W ibl C ijk ). At 
each iteration, 7 is updated by: 

7 i+1 = BoxProj(Y - rjVgiY)) (9) 

where 77 > is the step size. The BoxProj(-) denotes the simple box projection operator on 7 
as specified in the constraints of ([5]). At each iteration, the main computational complexity lies 
in the computation of the eigen-decomposition with a complexity of 0(md 3 ) and the computation 
of the gradient with a complexity of 0(m(nd 2 + cd)), where m is the number of basis metrics 
and c is the number of large margin triplet constraints. As in the weight learning problem the FISTA 
algorithm is employed to accelerate the optimization process; for lack of space we omit the algorithm 
presentation. 

4 Experiments 

In this section we will evaluate the performance of PLML and compare it with a number of rel- 
evant baseline methods on six datasets with large number of instances, ranging from 5K to 70K 
instances; these datasets are Letter, USPS, Pendigits, Optdigits, Isolet and MNIST. We want to de- 
termine whether the addition of manifold regularization on the local metrics improves the predictive 
performance of local metric learning, and whether the local metric learning improves over learning 
with single global metric. We will compare PLML against six baseline methods. The first, SML, is 
a variant of PLML where a single global metric is learned, i.e. we set the number of basis in ([6]) to 
one. The second, Cluster-Based LML (CBLML), is also a variant of PLML without weight learn- 
ing. Here we learn one local metric for each cluster and we assign a weight of one for a basis metric 
Mfc. if the corresponding cluster of M5. contains the instance, and zero otherwise. Finally, we 
also compare against four state of the art metric learning methods LMNN [15], BoostMetric fT3lQ 
GLML [11] and LMNN-MM 1 15 The former two learn a single global metric and the latter two 
a number of local metrics. In addition to the different metric learning methods, we also compare 
PLML against multi-class S VMs in which we use the one-against-all strategy to determine the class 
label for multi-class problems and select the best kernel with inner cross validation. 

Since metric learning is computationally expensive for datasets with large number of features we 
followed 031 and reduced the dimensionality of the USPS, Isolet and MINIST datasets by applying 
PCA. In these datasets the retained PC A components explain 95% of their total variances. We 
preprocessed all datasets by first standardizing the input features, and then normalizing the instances 
to so that their L2-norm is one. 

PLML has a number of hyper-parameters. To reduce the computational time we do not tune Ai 
and A2 of the weight learning optimization problem Q, and we set them to their default values of 
Ai = 1 and A2 = 100. The Laplacian matrix L is constructed using the six nearest neighbors graph 
following [19]. The anchor points U are the means of clusters constructed with k-means clustering. 
The number m of anchor points, i.e. the number of basis metrics, depends on the complexity of 
the learning problem. More complex problems will often require a larger number of anchor points 
to better model the complexity of the data. As the number of classes in the examined datasets is 
10 or 26, we simply set m = 20 for all datasets. In the basis metric learning problem ([6]), the 
number of the dual parameters 7 is the same as the number of triplet constraints. To speedup the 
learning process, the triplet constraints are constructed only using the three same-class and the three 
different-class nearest neighbors for each learning instance. The parameter a 2 is set to 1, while 
the parameter ol\ is the only parameter that we select from the set {0.01,0.1, 1, 10, 100} using 
2-fold inner cross-validation. The above setting of basis metric learning for PLML is also used 
with the SML and CBLML methods. For LMNN and LMNN-MM we use their default settings, 
|[T5lL in which the triplet constraints are constructed by the three nearest same-class neighbors and 
all different-class samples. As a result, the number of triplet constraints optimized in LMNN and 
LMNN-MM is much larger than those of PLML, SML, BoostMetric and CBLML. The local metrics 
are initialized by identity matrices. As in ifTTTl . GLML uses the Gaussian distribution to model the 
learning instances from the same class. Finally, we use the 1-NN rule to evaluate the performance 

1 http://code.google.com/p/boosting 

2 http ://w w w. cse . wustl . edu/^kilian/code/code . html . 
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(a) LMNN-MM 



(b) CBLML 



(c) GLML 



(d) PLML 



Figure 1: The visualization of learned local metrics of LMNN-MM, CBLML, GLML and PLML. 

Table 1: Accuracy results. The superscripts +_= next to the accuracies of PLML indicate the result 
of the McNemar's statistical test with LMNN, BoostMetric, SML, CBLML, LMNN-MM, GMLM 
and SVM. They denote respectively a significant win, loss or no difference for PLML. The bold 
indicates the algorithm have a performance that is not significant different from the best algorithm. 
The number in the parenthesis indicates the score of the respective algorithm for the given dataset 
based on the pairwise comparisons of the McNemar's statistical test. 







Single Metric Learning Baselines 


Local Metric Learning Baselines 




Datasets 


PLML 


LMNN 


BoostMetric 


SML 


CBLML 


LMNN-MM 


GLML 


SVM 


Letter 


97.22 4 


H 


l+ (7.0) 


96.08(2.5) 


96.49(4.5) 


96.71(5.5) 


95.82(2.5) 


95.02(1.0) 


93.86(0.0) 


96.64(5.0) 


Pendigits 


98.34 4 


++I++4 


l+ (7.0) 


97.43(2.0) 


97.43(2.5) 


97.80(4.5) 


97.94(5.0) 


97.43(2.0) 


96.88(0.0) 


97.91(5.0) 


Optdigits 


97.72= 


==l++4 


l= (5.0) 


97.55(5.0) 


97.61(5.0) 


97.22(5.0) 


95.94(1.5) 


95.94(1.5) 


94.82(0.0) 


97.33(5.0) 


Isolet 


95.25= 


+=|++4 


l= (5.5) 


95.51(5.5) 


89.16(2.5) 


94.68(5.5) 


89.03(2.5) 


84.61(0.5) 


84.03(0.5) 


95.19(5.5) 


USPS 


98.26 4 


+4-14-4-4 


1 (6.5) 


97.92(4.5) 


97.65(2.5) 


97.94(4.0) 


96.22(0.5) 


97.90(4.0) 


96.05(0.5) 


98.19(5.5) 


MNIST 


97.30= 


+4-14-4-4 


l=(6.0) 


97.30(6.0) 


96.03(2.5) 


96.57(4.0) 


95.77(2.5) 


93.24(1.0) 


84.02(0.0) 


97.62(6.0) 


Total Score 


37 


25.5 


19.5 


28.5 


14.5 


10 


1 


32.5 



of the different metric learning methods. In addition as we already mentioned we also compare 
against multi-class SVM. Since the performance of the latter depends heavily on the kernel with 
which it is coupled we do automatic kernel selection with inner cross validation to select the best 
kernel and parameter setting. The kernels were chosen from the set of linear, polynomial (degree 2,3 
and 4), and Gaussian kernels; the width of the Gaussian kernel was set to the average of all pairwise 
distances. Its C parameter of the hinge loss term was selected from {0.1, 1, 10, 100}. 

To estimate the classification accuracy for Pendigits, Optdigits, Isolet and MNIST we used the de- 
fault train and test split, for the other datasets we used 10-fold cross-validation. The statistical 
significance of the differences were tested with McNemar's test with a p-value of 0.05. In order to 
get a better understanding of the relative performance of the different algorithms for a given dataset 
we used a simple ranking schema in which an algorithm A was assigned one point if it was found 
to have a statistically significantly better accuracy than another algorithm B, 0.5 points if the two 
algorithms did not have a significant difference, and zero points if A was found to be significantly 
worse than B. 

4.1 Results 

In Table [T] we report the experimental results. PLML consistently outperforms the single global 
metric learning methods LMNN, BoostMetric and SML, for all datasets except Isolet on which 
its accuracy is slightly lower than that of LMNN. Depending on the single global metric learning 
method with which we compare it, it is significantly better in three, four, and five datasets ( for 
LMNN, SML, and BoostMetric respectively), out of the six and never singificantly worse. When 
we compare PLML with CBLML and LMNN-MM, the two baseline methods which learn one local 
metric for each cluster and each class respectively with no smoothness constraints, we see that it is 
statistically significantly better in all the datasets. GLML fails to learn appropriate metrics on all 
datasets because its fundamental generative model assumption is often not valid. Finally, we see 
that PLML is significantly better than SVM in two out of the six datasets and it is never significantly 
worse; remember here that with SVM we also do inner fold kernel selection to automatically select 
the appropriate feature speace. Overall PLML is the best performing methods scoring 37 points over 
the different datasets, followed by SVM with automatic kernel selection and SML which score 32.5 
and 28.5 points respectively. The other metric learning methods perform rather poorly. 

Examining more closely the performance of the baseline local metric learning methods CBLML and 
LMNN-MM we observe that they tend to overfit the learning problems. This can be seen by their 
considerably worse performance with respect to that of SML and LMNN which rely on a single 



x r 



(a) Letter (b) Pendigits (c) Optdigits 



^ s 




(d) USPS 



(e) Isolet 



(f) MNIST 



Figure 2: Accuracy results of PLML and CBLML with varying number of basis metrics. 



global model. On the other hand PLML even though it also learns local metrics it does not suffer 
from the overfitting problem due to the manifold regularization. The poor performance of LMNN- 
MM is not in agreement with the results reported in [15]. The main reason for the difference is the 
experimental setting. In [ 15 ], 30% of the training instance of each dataset were used as a validation 
set to avoid overfitting. 

To provide a better understanding of the behavior of the learned metrics, we applied PLML LMNN- 
MM, CBLML and GLML, on an image dataset containing instances of four different handwritten 
digits, zero, one, two, and four, from the MNIST dataset. As in Q3, we use the two main principal 
components to learn. Figure [T] shows the learned local metrics by plotting the axis of their corre- 
sponding ellipses (black line). The direction of the longer axis is the more discriminative. Clearly 
PLML fits the data much better than LMNN-MM and as expected its local metrics vary smoothly. 
In terms of the predictive performance, PLML has the best with 82.76% accuracy. The CBLML, 
LMNN-MM and GLML have an almost identical performance with respective accuracies of 82.59%, 
82.56% and 82.51%. 

Finally we investigated the sensitivity of PLML and CBLML to the number of basis metrics, we 
experimented with m G {5, 10, 15, 20, 25, 30, 35, 40}. The results are given in Figure [2] We see 
that the predictive performance of PLML often improves as we increase the number of the basis 
metrics. Its performance saturates when the number of basis metrics becomes sufficient to model the 
underlying training data. As expected different learning problems require different number of basis 
metrics. PLML does not overfit on any of the datasets. In contrast, the performance of CBLML gets 
worse when the number of basis metrics is large which provides further evidence that CBLML does 
indeed overfit the learning problems, demonstrating clearly the utility of the manifold regularization. 



5 Conclusions 

Local metric learning provides a more flexible way to learn the distance function. However they are 
prone to overfitting since the number of parameters they learn can be very large. In this paper we 
presented PLML, a local metric learning method which regularizes local metrics to vary smoothly 
over the data manifold. Using an approximation error bound of the metric matrix function, we 
parametrize the local metrics by a weighted linear combinations of local metrics of anchor points. 
Our method scales to learning problems with tens of thousands of instances and avoids the overfitting 
problems that plague the other local metric learning methods. The experimental results show that 
PLML outperforms significantly the state of the art metric learning methods and it has a performance 
which is significantly better or equivalent to that of SVM with automatic kernel selection. 
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